Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# Hi IODS, I am eager to learn about open data, open science and data science. I expect to improve my skills on data analysis and interpretation. I also hope to learn new things. This course was highly recommended by a collegue who attended it in the past. So I am excited and ready for learning.

date()
## [1] "Fri Dec  2 13:44:03 2022"

MY experience with exercise set 1

# I am familiar to R but I found some information and suggestion in the book that might be useful.I usually save R file and folders for a particular project in a designated project folder, but never have an RStudio project. This might be usefult to try and see the difference compared to my approach. I am also familiar to most of the contents in chapter 1-4. However it was fun to run the codes and realize ways to improve my codes and plots.

date()
## [1] "Fri Dec  2 13:44:03 2022"

ASSIGNMENT 2: REGRESSION AND MODEL VALIDATION

Describe the work you have done this week and summarize your learning.

date()
## [1] "Fri Dec  2 13:44:03 2022"

This week I studied chapter 7 “Linear regression” from the book R for health data science, available here: https://argoshare.is.ed.ac.uk/healthyr_book/chap07-h1.html. I am also learning data wrangling, which means how to manage large datatables and create small, managable working size tables. After that I am learning regression analysis and model validation with univariate and multivariate regression models.

Data wrangling (max 5 points)

#The codes for this exercise is available inside data folder along with the file created
#create_learning2014.R
#assignment2_regression_and_model_validation_data_wrangling.csv

Analysis (max 15 points)

Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt . (The separator is a comma “,” and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)

Learning the dataset

This dataset comes from survey during the study “The relationship between learning approaches and students’ achievements in an introductory statistics course in Finland, 2015”. The results of the study are available here: https://www.slideshare.net/kimmovehkalahti/the-relationship-between-learning-approaches-and-students-achievements-in-an-introductory-statistics-course-in-finland

learning2014 <- read.csv("data/assignment2_regression_and_model_validation_data_wrangling.csv")
dim(learning2014)
## [1] 166   7

The dataset has 166 rows and 7 columns. I am using a shorter version of the dataset produced by data wrangling. Here is the structure of the dataset:

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Each column is described below: Gender represents Caterogical variable with Male (M) and Female (F). Age represents Age (in years) derived from the date of birth. Attitude is numerical variable representing Global attitude toward statistics. deep is numerical variable representing Deep approach calculated by taking mean of several other variable output from the survey that aimed to understand about seeking Meaning, relating ideas and use of evidence stra is numerical variable representing Strategic approach calculated by taking mean of several other variables that aimed to understand about Organized Studying and Time Management. surf is numerical variable representing Surface approach calculated by taking mean of several other variables that aimed to understand about Lack of Purpose, Unrelated Memorising and Syllabus-boundness. Points is numerical variable with integers representing Exam points.

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

Graphical overview

Here is a summary table of the above dataset:

summary(learning2014)
##     gender               Age           Attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

In this study female participants were almost double in number compared to male participants. Based on the summary analysis deep, surface, strategic questions and points are normally distributed as their mean and median values are very close. Age distribution seems to be scewed towards left, i.e younger participants (Standard deviation: 21 - 27 years old), with a few older students (maximum 55 years old).

Here is a graphical overview from the dataset for better understanding of the relationship of available variables:

library(GGally) 
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
# Draw the plot 
p <- ggpairs(learning2014, mapping = aes(col = gender), 
                     lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
p

Male students are more likely to show high attitude towards statistics. There was almost no correlation between age and other variables. Male students show significant negative correlation in Surface approach and their attitude as well as response to deep approach questions. There was significant negative correlation between attitude and points scored for both genders. Also negative correlation is observed in both genders points scored and their response to deep and surface approach, although the results were not significant.

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)

Regression model

# fit a linear model
my_model <- lm(Points ~ Attitude + stra+ surf, data = learning2014)
# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## Attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In this multivariate model the most important factor affecting the student’s final exam result appears to be only Attitude (p-value less than 0)! There is no significant relationship between strategic and surface approach to points scored. The R-squared value of 0.2074 implies that the model can explain 20% or about one-fifth of the variation in the outcome.

# fit a linear model with variangle with statistically significant relationship with Points, i.e Attitude
my_model <- lm(Points ~ Attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)

Target variable ‘Points’ has statistically significant relationship with explanatory variable ‘Attitute’ (P-value less than 0). At estimated (theoretical) attitude value of 0 the points scored would be 11.64 with a standard error 1.83. For every point of attitude increased, there is 3.53 more exam points scored with a standard error of 0.57.

R-squared is always between 0 and 1, 0 (representing 0%) indicates that the model explains none of the variability of the response data around its mean. 1 (representing 100%) indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, Multiple R-squared value of 0.1906 implies that the model can explain only 19.06% of the variation in the outcome.

This is how the model looks

library(ggplot2)
qplot(Attitude, Points, data = learning2014) + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
## `geom_smooth()` using formula 'y ~ x'

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)

Diagnostic plots

#R makes it easy to graphically explore the validity of your model assumptions. If you give a linear model object as the first argument to the `plot()` function, the function automatically assumes you want diagnostic plots and will produce them. You can check the help page of plotting an lm object by typing `?plot.lm` or `help(plot.lm)` to the R console. 

#In the plot function you can then use the argument `which` to choose which plots you want. `which` must be an integer vector corresponding to the following list of plots:
#which | graphic                                 
#----- | --------
#1     | Residuals vs Fitted values 
#2     | Normal QQ-plot
#3     | Standardized residuals vs Fitted values
#4     | Cook's distances
#5     | Residuals vs Leverage 
#6     | Cook's distance vs Leverage
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#Residuals vs Fitted values
plot(my_model, which = 1)

The Residuals vs Fitted plot studies the variance of the residuals. This plot represents the deviation of the observed values of an element of a statistical sample from its theoretical value. The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.

Here, the linearity of the regression model is supported by homoscedasticity, or homogeneity of variances. Homoscedasticity is an assumption of equal or similar variances in different groups being compared. In this model we see that residuals appear close to 10 points difference (positive and negative) from the fitted values. There are three exceptional observations 145, 56 and 35.

#Normal QQ-plot
plot(my_model, which = 2)

A Q-Q (quantile-quantile) plot is a probability plot. It is used for comparing two probability distributions by plotting their quantiles against each other. Here, the two distributions being compared are the fitted model vs the observed values. Our model is normally distrubuted. There are some non fitting observations at the extremes of the distribution. These are expected to have very little impact on the model. The observations 145, 56 and 35 again appear as not good representations of the fitted model.

#Residuals vs Leverage
plot(my_model, which = 5)

Here we plotted the standardized residuals vs the leverage of the fitted model. Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. This plot shows the impact of every single observation on the fitted model. There are some results that have a high impact on the model,suggesting that the results are driven by a few data points. Here again 56 and 35and a new observation 71 have influence but 145 does not have very high leverage.

*After completing all the phases above you are ready to submit your Assignment for the review (using the Moodle Workshop below). Have the two links (your GitHub repository and your course diary) ready!

The link to my GitHub repository: https://github.com/adhisadi/IODS-project The link to my course diary: https://adhisadi.github.io/IODS-project/

ASSIGNMENT 3: LOGISTIC REGRESSION

date()
## [1] "Fri Dec  2 13:44:13 2022"

Read the joined student alcohol consumption data into R either from your local folder (if you completed the Data wrangling part) or from this url (in case you got stuck with the Data wrangling part) https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv (In the above linked file, the column separator is a comma and the first row includes the column names). Print out the names of the variables in the data and describe the data set briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-1 point)

Load the dataset

I have created the file in data wrangling part. But still reading from both data wrangling and online source, just to check

library(readr)
setwd("/Users/admin_adhisadi/workspace/School_work/IODS-project")
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
alc2 <- read.csv("data/math_por.csv", sep = ",", header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
colnames(alc2)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
str(alc)
## spec_tbl_df [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:370] "F" "F" "F" "F" ...
##  $ age       : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr [1:370] "U" "U" "U" "U" ...
##  $ famsize   : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr [1:370] "A" "T" "T" "T" ...
##  $ Medu      : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr [1:370] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr [1:370] "teacher" "other" "other" "services" ...
##  $ reason    : chr [1:370] "course" "course" "other" "home" ...
##  $ guardian  : chr [1:370] "mother" "father" "mother" "mother" ...
##  $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
##  $ famsup    : chr [1:370] "no" "yes" "no" "yes" ...
##  $ activities: chr [1:370] "no" "no" "no" "yes" ...
##  $ nursery   : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:370] "no" "yes" "yes" "yes" ...
##  $ romantic  : chr [1:370] "no" "no" "no" "yes" ...
##  $ famrel    : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr [1:370] "no" "no" "yes" "yes" ...
##  $ absences  : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   paid = col_character(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   alc_use = col_double(),
##   ..   high_use = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(alc2)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
# Analysis done using one of the datasets

Student performance dataset:

On week 3, we are working with a new dataset about student performance. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. More information available here: https://archive.ics.uci.edu/ml/datasets/Student+Performance ‘alc_use’ column shows average of weekday and weekend alcohol consumption (Dalc + Walc) ‘high_use’ column was created using ‘alc_use’ column. TRUE is used for students for which ‘alc_use’ is greater than 2 (and FALSE otherwise)

Variable names included for the analysis and structure of dataset:

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
str(alc)
## spec_tbl_df [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:370] "F" "F" "F" "F" ...
##  $ age       : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr [1:370] "U" "U" "U" "U" ...
##  $ famsize   : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr [1:370] "A" "T" "T" "T" ...
##  $ Medu      : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr [1:370] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr [1:370] "teacher" "other" "other" "services" ...
##  $ reason    : chr [1:370] "course" "course" "other" "home" ...
##  $ guardian  : chr [1:370] "mother" "father" "mother" "mother" ...
##  $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
##  $ famsup    : chr [1:370] "no" "yes" "no" "yes" ...
##  $ activities: chr [1:370] "no" "no" "no" "yes" ...
##  $ nursery   : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:370] "no" "yes" "yes" "yes" ...
##  $ romantic  : chr [1:370] "no" "no" "no" "yes" ...
##  $ famrel    : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr [1:370] "no" "no" "yes" "yes" ...
##  $ absences  : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   paid = col_character(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   alc_use = col_double(),
##   ..   high_use = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

Hypothesis

Basically with this dataset we are studying relationship between student’s behavior and alcohol consumption. I am taking these 4 variables: famsize, Pstatus, activities and absences. Here are description of the columns and my hypothesis. My hypothesis mostly comes from stereotypes.

famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3).
Big family size could mean extended family as uncle aunt grandparents or more children in the family, depends on the culture I think. I assume here big family size means more number of children. Hence my hypothesis is that there may be higher alcohol consumption with increased family size. This could be due to several factors such as difficulty in managing financial and emotional needs of each children by the parents.

Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart). Students raised by single parent have increased alcohol consumption, similar reasons as above: difficulty in managing financial and emotional needs of child.

activities - extra-curricular activities (binary: yes or no). Increased extra-curricular activities means active and busy life and more social support. Hence there is less likelihood of alcohol consumption.

absences - number of school absences (numeric: from 0 to 93). Increased number of school absences means there may be several problems in the student’s life. Hence there is increased likelihood of alcohol consumption. In addition all of the above factors, increased family size, parents living apart, and low extra curricular activities is likely to contribute to higher absences along with increased alcohol consumption.

Plots

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gmodels)
library(ggplot2)
CrossTable(alc$high_use, alc$famsize, dnn = c("High alcohol consumption", "Family size"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  370 
## 
##  
##                          | Family size 
## High alcohol consumption |       GT3 |       LE3 | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                    FALSE |       192 |        67 |       259 | 
##                          |     0.181 |     0.462 |           | 
##                          |     0.741 |     0.259 |     0.700 | 
##                          |     0.722 |     0.644 |           | 
##                          |     0.519 |     0.181 |           | 
## -------------------------|-----------|-----------|-----------|
##                     TRUE |        74 |        37 |       111 | 
##                          |     0.422 |     1.078 |           | 
##                          |     0.667 |     0.333 |     0.300 | 
##                          |     0.278 |     0.356 |           | 
##                          |     0.200 |     0.100 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |       266 |       104 |       370 | 
##                          |     0.719 |     0.281 |           | 
## -------------------------|-----------|-----------|-----------|
## 
## 
p1 <- alc %>% 
  ggplot(aes(x = famsize, fill = high_use)) + 
  geom_bar()

p2 <- alc %>% 
  ggplot(aes(x = famsize, fill = high_use)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

library(patchwork)
p1 + p2

Large family size does not mean high alcohol consumption.

CrossTable(alc$high_use, alc$Pstatus, dnn = c("High alcohol consumption", "Parental status"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  370 
## 
##  
##                          | Parental status 
## High alcohol consumption |         A |         T | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                    FALSE |        26 |       233 |       259 | 
##                          |     0.014 |     0.002 |           | 
##                          |     0.100 |     0.900 |     0.700 | 
##                          |     0.684 |     0.702 |           | 
##                          |     0.070 |     0.630 |           | 
## -------------------------|-----------|-----------|-----------|
##                     TRUE |        12 |        99 |       111 | 
##                          |     0.032 |     0.004 |           | 
##                          |     0.108 |     0.892 |     0.300 | 
##                          |     0.316 |     0.298 |           | 
##                          |     0.032 |     0.268 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |        38 |       332 |       370 | 
##                          |     0.103 |     0.897 |           | 
## -------------------------|-----------|-----------|-----------|
## 
## 
p1 <- alc %>% 
  ggplot(aes(x = Pstatus, fill = high_use)) + 
  geom_bar()

p2 <- alc %>% 
  ggplot(aes(x = Pstatus, fill = high_use)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")
p1 + p2

Parent’s cohabitation status does not seem to contribute to alcohol consumption.

CrossTable(alc$high_use, alc$activities, dnn = c("High alcohol consumption", "Extra-curricular activities"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  370 
## 
##  
##                          | Extra-curricular activities 
## High alcohol consumption |        no |       yes | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                    FALSE |       120 |       139 |       259 | 
##                          |     0.224 |     0.210 |           | 
##                          |     0.463 |     0.537 |     0.700 | 
##                          |     0.670 |     0.728 |           | 
##                          |     0.324 |     0.376 |           | 
## -------------------------|-----------|-----------|-----------|
##                     TRUE |        59 |        52 |       111 | 
##                          |     0.523 |     0.490 |           | 
##                          |     0.532 |     0.468 |     0.300 | 
##                          |     0.330 |     0.272 |           | 
##                          |     0.159 |     0.141 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |       179 |       191 |       370 | 
##                          |     0.484 |     0.516 |           | 
## -------------------------|-----------|-----------|-----------|
## 
## 
p1 <- alc %>% 
  ggplot(aes(x = activities, fill = high_use)) + 
  geom_bar()

p2 <- alc %>% 
  ggplot(aes(x = activities, fill = high_use)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")
p1 + p2

Extra-curricular activities does not seem to affect alcohol consumption.

library(ggplot2)
#let's plot absences as it is numeric
p1 <- ggplot(alc, aes(x= absences, fill=high_use)) + geom_density(alpha=0.4)
p1+ theme_bw()

Logistic regression model

#model
m <- glm(high_use ~ (alc$famsize == "GT3") + (activities == "no") + (Pstatus == "A") + absences, data = alc, family = "binomial")
#odds ratio
OR <- coef(m) %>% exp
#confidence interval
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
#summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ (alc$famsize == "GT3") + (activities == 
##     "no") + (Pstatus == "A") + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2960  -0.8113  -0.7217   1.2363   1.8858  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.1310     0.2708  -4.177 2.96e-05 ***
## alc$famsize == "GT3"TRUE  -0.3660     0.2568  -1.425  0.15404    
## activities == "no"TRUE     0.2847     0.2350   1.211  0.22575    
## Pstatus == "A"TRUE        -0.2759     0.4029  -0.685  0.49336    
## absences                   0.0900     0.0234   3.846  0.00012 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 430.89  on 365  degrees of freedom
## AIC: 440.89
## 
## Number of Fisher Scoring iterations: 4
coef(m)
##              (Intercept) alc$famsize == "GT3"TRUE   activities == "no"TRUE 
##              -1.13102023              -0.36599660               0.28471938 
##       Pstatus == "A"TRUE                 absences 
##              -0.27594749               0.08999505
#odds ratios with their confidence intervals
cbind(OR, CI)
##                                 OR     2.5 %    97.5 %
## (Intercept)              0.3227039 0.1869931 0.5422024
## alc$famsize == "GT3"TRUE 0.6935052 0.4203051 1.1526892
## activities == "no"TRUE   1.3293889 0.8391496 2.1118619
## Pstatus == "A"TRUE       0.7588528 0.3313845 1.6273877
## absences                 1.0941689 1.0473341 1.1480407

The results show that increased family size, parents living apart, and low extra curricular activities are not likely to contribute to increased alcohol consumption. OR = 1.09 shows that the odds that the student will be consuming alcohol in excess is nearly always. Confidence intervals for other variables are varying (0.18, highly unlikely to 2.11, highly likely). My hypothesis was not proven to be correct.

Predictive power and graphical presentation

# predictions versus actual values
# predict() the probabilty of high_use
probability <- predict(m, type = "response")
# add the predicted probability to 'alc'
alc <- mutate(alc, probability = probability)
# use probability to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5 )
# tabulate target variable versus predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% 
prop.table() %>% 
addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67837838 0.02162162 0.70000000
##    TRUE  0.26756757 0.03243243 0.30000000
##    Sum   0.94594595 0.05405405 1.00000000

This shows that my hypothesis was bad. These common negative stereotypes I mentioned in my hypothesis should be avoided. The only variable that work from my model is absence, which is shown below:

New model

# new model
m2 <- glm(high_use ~ absences, data = alc, family = "binomial")
# Odds ratios
OR2 <- coef(m2) %>% exp
# confidence intervals
CI2 <- confint(m2) %>% exp
## Waiting for profiling to be done...
#probability of high alcohol consumption
probability_2 <- predict(m2, type = "response")
# add predicted probability to 'alc'
alc <- mutate(alc, probability_2 = probability_2)
# use probability to make a prediction of high_use
alc <- mutate(alc, prediction_2 = probability_2 > 0.5 )
# Summary of the model
summary(m2)
## 
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3598  -0.8209  -0.7318   1.2658   1.7419  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.26945    0.16094  -7.888 3.08e-15 ***
## absences     0.08867    0.02317   3.827  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 434.52  on 368  degrees of freedom
## AIC: 438.52
## 
## Number of Fisher Scoring iterations: 4
# Coefficients of the model
coef(m2)
## (Intercept)    absences 
## -1.26944532  0.08866504
# Odds ratios with their confidence intervals
cbind(OR2, CI2)
##                   OR2     2.5 %    97.5 %
## (Intercept) 0.2809874 0.2030806 0.3819961
## absences    1.0927146 1.0464586 1.1459894

The results show that students who are more absent from their classes are more likely consume excess alcohol. Let’s see how these two models compare with a likelihood ratio test:

## Analysis of Deviance Table
## 
## Model 1: high_use ~ (alc$famsize == "GT3") + (activities == "no") + (Pstatus == 
##     "A") + absences
## Model 2: high_use ~ absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       365     430.89                     
## 2       368     434.52 -3  -3.6311   0.3042
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability_2, y = high_use, col= prediction_2)) + xlab("probability")
# define the geom as points and draw the plot
g + geom_point() 

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction_2) %>% 
prop.table() %>% 
addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68648649 0.01351351 0.70000000
##    TRUE  0.27567568 0.02432432 0.30000000
##    Sum   0.96216216 0.03783784 1.00000000

ASSIGNMENT 4: CLUSTERING AND CLASSIFICATION

date()
## [1] "Fri Dec  2 13:44:16 2022"

Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’. Then include the file as a child file in your ‘index.Rmd’ file. Perform the following analysis in the file you created. Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here. (0-1 points) https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Load the data

library(dplyr)
library(ggplot2)
library(GGally)
library(reshape)
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Boston dataset consists of housing Values in Suburbs of Boston. It has 506 rows and 14 columns. Here is the description of each column: crim: per capita crime rate by town.

zn: proportion of residential land zoned for lots over 25,000 sq.ft.

indus: proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: nitrogen oxides concentration (parts per 10 million).

rm: average number of rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.

rad: index of accessibility to radial highways.

tax: full-value property-tax rate per $10,000.

ptratio: pupil-teacher ratio by town.

black: 1000(Bk - 0.63)21000(Bk−0.63)2 where BkBk is the proportion of blacks by town.

lstat:lower status of the population (percent).

medv: median value of owner-occupied homes in $1000s.

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)

Graphical overview

cor_matrix <- cor(Boston) 
cor_matrix <- cor_matrix %>% round(digits = 2)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

library("psych")  #for p-value table
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
cor_test_mat <- corr.test(cor_matrix)$p    # Apply corr.test function
cor_test_mat 
##                 crim           zn        indus      chas          nox
## crim    0.000000e+00 5.460004e-02 5.861585e-03 1.0000000 7.907814e-03
## zn      2.284457e-03 0.000000e+00 1.271309e-04 1.0000000 2.417195e-04
## indus   1.105960e-04 1.672775e-06 0.000000e+00 1.0000000 9.560598e-08
## chas    4.167137e-01 9.742994e-01 7.988462e-01 0.0000000 1.000000e+00
## nox     1.550552e-04 3.357216e-06 1.074225e-09 0.9200326 0.000000e+00
## rm      2.431550e-03 2.434360e-03 3.886776e-04 0.4966148 1.649605e-03
## age     6.401623e-04 3.007306e-07 1.054161e-07 0.9626125 5.329690e-09
## dis     5.756140e-04 2.863417e-07 5.617665e-08 0.9200861 2.056115e-09
## rad     1.066445e-06 2.865584e-04 1.712863e-06 0.5139848 4.791323e-06
## tax     2.762339e-06 1.722141e-04 1.424366e-07 0.5004301 1.058846e-06
## ptratio 1.738565e-03 8.983828e-04 7.449341e-04 0.2954829 4.565702e-03
## black   6.091967e-05 8.087550e-03 3.260073e-04 0.6344478 3.329302e-04
## lstat   5.858634e-05 7.173685e-05 8.898971e-07 0.5508139 4.162472e-06
## medv    1.706341e-04 8.170349e-04 4.791024e-05 0.3461764 2.671186e-04
##                   rm          age          dis          rad          tax
## crim    5.460004e-02 2.304584e-02 2.129772e-02 8.259000e-05 2.016508e-04
## zn      5.460004e-02 2.405844e-05 2.319368e-05 1.318169e-02 8.531706e-03
## indus   1.554711e-02 8.854952e-06 4.775015e-06 1.284648e-04 1.167980e-05
## chas    1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## nox     4.948816e-02 4.636831e-07 1.809382e-07 3.353926e-04 8.259000e-05
## rm      0.000000e+00 5.460004e-02 1.170387e-01 5.132770e-02 2.777919e-02
## age     2.275002e-03 0.000000e+00 8.982624e-08 4.066218e-03 1.841719e-03
## dis     6.665341e-03 9.980694e-10 0.000000e+00 2.689767e-03 1.433697e-03
## rad     1.974142e-03 7.530033e-05 4.482945e-05 0.000000e+00 6.702657e-11
## tax     8.187115e-04 2.877686e-05 2.166334e-05 7.365557e-13 0.000000e+00
## ptratio 2.925326e-04 4.236675e-03 6.502152e-03 3.354635e-04 2.702118e-04
## black   1.717361e-02 1.503185e-03 1.809994e-03 2.139846e-05 2.902442e-05
## lstat   2.721164e-06 7.283252e-06 6.071380e-05 2.338660e-05 5.374022e-06
## medv    1.140787e-07 4.559715e-04 2.010551e-03 1.193301e-04 3.677290e-05
##              ptratio        black        lstat         medv
## crim    4.948816e-02 0.0034606865 3.398008e-03 8.531706e-03
## zn      2.874825e-02 0.1264815268 3.945527e-03 2.777919e-02
## indus   2.607269e-02 0.0142572745 7.030187e-05 2.826704e-03
## chas    1.000000e+00 1.0000000000 1.000000e+00 1.000000e+00
## nox     8.674833e-02 0.0142572745 2.955355e-04 1.282169e-02
## rm      1.318169e-02 0.2404305720 2.013662e-04 9.468528e-06
## age     8.473349e-02 0.0465987475 4.952611e-04 1.778289e-02
## dis     1.170387e-01 0.0494881609 3.460687e-03 5.132770e-02
## rad     1.425727e-02 0.0014336967 1.520129e-03 6.205164e-03
## tax     1.282169e-02 0.0018417188 3.708075e-04 2.243147e-03
## ptratio 0.000000e+00 0.1264815268 1.425727e-02 1.841719e-03
## black   7.905095e-03 0.0000000000 1.812792e-02 4.948816e-02
## lstat   3.240290e-04 0.0004770506 0.000000e+00 1.908560e-06
## medv    2.901205e-05 0.0016797274 2.219256e-08 0.000000e+00
corrplot(cor_matrix,p.mat = cor_test_mat,insig = "p-value")

This plot shows the correlation matrix. The areas of circles show the absolute value of corresponding correlation coefficients. Lighter shade of circle means decreasing correlation (white is no correlation). Darker shade of blue means increasing positive correlation and darker shade towards red means increasing negative correlation.The number inside the cells show p-values; p-values have been added only to those cells, where the correlation coefficients are not significant. We see both positive and negative correlations among several varaibles with significant p-value. We see high negative correlation of dis with indus, nox and age and that also between lstat and medv. We see highest positive correlation of rad with tax.

cor_matrix <- as.data.frame(cor_matrix)
p <- ggpairs(cor_matrix, mapping = aes(), 
                     lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
p

We see bimodal distributions for most variables. Variable zn, rm, dis, black appear to be skewed right.

# Distribution can be viewed also like this:
# Note this part added later for improvement (after the assignment deadline was over) 
ggplot(data = melt(Boston), aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")
## Using  as id variables

Here is the summary of the dataset

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)

Standardizing data, creating a categorical variable and forming a train and test set

# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim) #required for next step
# summary of the scaled crime rate
summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620
# create a categorical variable of the crime rate in the Boston dataset, Use the quantiles as the break points 
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
table(crime)
## crime
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##             127             126             126             127
#Drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

#Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

By scaling, we have subtracted the column means from the corresponding columns and divided the difference with standard deviation. \[scaled(x) = \frac{x - mean(x)}{ sd(x)}\] As a result in scaled output we see that for each variable, now the mean is 0; min, 1st Quadrant, 2nd Quadrant are negative numbers; 3rd, 4th quadrant and max are positive numbers.

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)

Fit linear discriminant analysis on train set

# linear discriminant analysis on train set
#I am taking boston_scaled from here as somehow the lda function produces warning in my previous boston_scaled dataframe.

boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]

lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2301980 0.2450495 0.2722772 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.02123632 -0.9072611 -0.15653200 -0.8872605  0.4315419 -0.8592762
## med_low  -0.09881021 -0.2554344 -0.06065701 -0.5357652 -0.1342905 -0.2815545
## med_high -0.38242941  0.1319232  0.12535782  0.3053716  0.1602694  0.3189338
## high     -0.48724019  1.0169558 -0.02178633  1.0686460 -0.4610138  0.8126151
##                 dis        rad        tax    ptratio       black       lstat
## low       0.9021010 -0.6846209 -0.7340671 -0.4341205  0.38382727 -0.75061951
## med_low   0.3490730 -0.5570619 -0.4744850 -0.0430348  0.32100583 -0.13859703
## med_high -0.3529831 -0.4018374 -0.3315206 -0.2687348  0.07302245 -0.05396767
## high     -0.8586216  1.6397657  1.5152267  0.7826832 -0.72481360  0.87629325
##                 medv
## low       0.51034154
## med_low  -0.01292014
## med_high  0.22619348
## high     -0.66375507
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.071863482  0.86242861 -0.84833398
## indus   -0.019259668 -0.21520528  0.31066572
## chas    -0.058390920 -0.06353660  0.09584073
## nox      0.444288434 -0.68013671 -1.44449341
## rm      -0.116720317 -0.10308821 -0.14552478
## age      0.203697174 -0.16323830  0.04514572
## dis     -0.095736488 -0.22202915  0.32210329
## rad      3.167080140  0.91161969 -0.01832820
## tax      0.005680403 -0.03881713  0.57208000
## ptratio  0.105576691  0.05973318 -0.22974536
## black   -0.100341892  0.01610954  0.13227131
## lstat    0.225974255 -0.28091242  0.29854009
## medv     0.189728166 -0.40593418 -0.24607747
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9546 0.0350 0.0105
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric. Leaving this as character gives error. 
classes <- as.numeric(train$crime)

# Draw the LDA (bi)plot
plot(lda.fit, dimen = 2,  col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)

Predict the classes with the LDA model on the test data

test <- boston_scaled[-ind,]
# save the crime categories from the test set
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      12        1    0
##   med_low    7      21        5    0
##   med_high   0       8       18    1
##   high       0       0        0   17
plot(table(correct = correct_classes , predicted = lda.pred$class))

#wrong predictions in the (training) data
mean(correct_classes != lda.pred$class)
## [1] 0.3333333

Here is the test error rate:

error <- mean(correct_classes != lda.pred$class)
errorrate <- round(error*100, digits = 2)
paste0("Test errorrate is ",errorrate, "%")
## [1] "Test errorrate is 33.33%"

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

Investigate the optimal number of clusters for kmeans

library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') + scale_x_continuous(breaks=seq(1, 10, 1))

The value of total WCSS changes radically at 2. So two clusters would seem optimal

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

This is impossible to intrepret. Lets try different ways to look at clusters, 1.by dividing the dataset to include only few columns, 2. each correlation separately (doing only two that seem to have correlation for now)

pairs(boston_scaled[, 1:7], col = km$cluster)

pairs(boston_scaled[, 8:14], col = km$cluster)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%()      masks ggplot2::%+%()
## ✖ psych::alpha()    masks ggplot2::alpha()
## ✖ tidyr::expand()   masks reshape::expand()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ reshape::rename() masks dplyr::rename()
## ✖ MASS::select()    masks dplyr::select()
clusters <- factor(km$cluster)
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled %>% ggplot(aes(x = indus, y = nox, col = clusters)) +
  geom_point()

boston_scaled <- as.data.frame(boston_scaled)
boston_scaled %>% ggplot(aes(x =lstat, y = medv, col = clusters)) +
  geom_point()

#ggpairs(boston_scaled, mapping = aes(col = clusters), lower = list(combo = wrap("facethist", bins = 20))) + theme_bw() 
#commented because it is still crowded figure.

We see positive correlation between indus and nox for cluster 1 but no correlation for cluster 2. There is negative correlation between lstat and medv for both clusters.

date()
## [1] "Fri Dec  2 13:44:45 2022"

ASSIGNMENT 5: DIMENSIONALITY REDUCTION TECHNIQUES

Data wrangling

Data wrangling part completed and saved by the name create_human_assignment5.R in data folder. Human data saved as human.txt.

Data analysis

Loading from my data wrangling as well as link. Using only one for analysis

human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt",sep =",", header = T)
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
human2 <- read.table("data/human.txt")
dim(human2)
## [1] 155   8
str(human2)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
#they are same

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

Graphical overview and summary

library(GGally)
library(ggplot2)
p <- ggpairs(human, mapping = aes(), lower = list(combo=wrap("facethist", bins=20)))+theme_bw()
p

Correlation and distribution can also be visualized separately

# Correlation
library(corrplot)
cor_mat <- cor(human)
corrplot.mixed(cor_mat)

There is significant positive correlation OF Edu2.FM with Edu.Exp, Life.Exp and GNI. There is significant negative correlaton of Edu2.FM with Mat.Mor and Ado.Birth. Also positive correlation between Labo.FM with Mat.Mor and Parli.F. Significant positive correlation also seen in Edu.Exp with Life.Exp, GNI and Parli.F. Significant negative correlaiton observed of Edu exp with Mat.Mor and Ado.Birth. GNI shows significant negative correlation with Mat.Mor and Ado.Birth. Also Mat.Mor is significantly positively correlated with Ado.Birth.

# Distribution
library(reshape)
ggplot(data = melt(human), aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")
## Using  as id variables

Edu.Exp is normally distributed. Edu2.FM is slightly skewed left, more or less normally distributed. Labo.FM and Life.Ep are skewed left. GNI, Mat.Mor, Ado.Birth skewed right and Parli.F more or less skewed right.

Here is the summary of the data:

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

Principal component analysis (PCA) on the raw (non-standardized) human data

pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

PCA on standardized the variables in the human data

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
# draw a biplot of the principal component representation and the original variables

pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col = c("grey40", "deeppink2"))

We know that BY scaling we subtract the column means from the corresponding columns and divided the difference with standard deviation. This brings the mean of each variable to zero. This can be seen as the difference in the plots if we compare the plots. In the unscaled plot, we get the warning “Warning: zero-length arrow is of indeterminate angle and so skipped”. So we see only one arrow for GNI.

In loading plot, the orientation (direction) of the vector, with respect to the principal component space, in particular, its angle with the principal component axes defines its contribution to the principal component: the more parallel to a principal component axis is a vector, the more it contributes only to that PC. In case of unscaled where we see only GNI, GNI is parallel to PC1 axis, hence GNI has strong influence on PC1.

For scaled data, since the 0 is at the middle for both PCs, we see all variables labelled in the plot. Parli.F and Labo.FM strongly influence PC2. Edu.Exp, GNI, Edu2.FM, Life.Exp, Mat.Mor and Ado.Birth have strong say in PC1. Result on GNI is unchanged betweem scaled and unscaled data. Scaling and unscaling also seems to affect how the clusters of countries look based on these variables. In loading plot the angles between vectors of different variables show their correlation in this space: small angles represent high positive correlation, right angles represent lack of correlation, opposite angles represent high negative correlation. In unscaled since there is only one line, we can not intrepret this but in scaled we can. High positive correlation observed between Labo.FM and Parli.F, Mat.Mor and Ado.Birth. Also high positive correlation between Edu.Exp, GNI, Life.Exp and Edu2.FM.

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)

Loading plot in the above biplot can also be intrepreted by the length in the space; the longer the vector, the more variability of this variable is represented by the two displayed principal components; short vectors are thus better represented in other dimension. Based on the length of vectors in the scaled plot, PC1 and PC2 seem to be mostly representing variabiliy in education experience, Life expectancy, maternal mortality ratio and perhaps also ratio of Female and Male populations with secondary education.

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). Load the tea dataset and convert its character variables to factors: Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.

Load and explore tea dataset

library(dplyr)
library(tidyr)
library(ggplot2)

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
#View(tea) #commented because it opens new R tab to view the data everytime I run the code, kindof annoying

Visualize the data set

# visualize the dataset
library(dplyr)
library(tidyr)
# including only some column names to keep in the dataset

#tea_time <- tea %>% select(Tea, How, how, sugar, where, lunch) #somehow index.Rmd gives me error with this code.

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, keep_columns)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(keep_columns)` instead of `keep_columns` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") +geom_bar() +theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Most people see to prefer packaged tea bag. Also most people seem no to prefer lunch time as tea time and also not prefer lemon, milk or anything in their tea. There are almost equal number of people prefering sugar and no sugar. They seem to mostly consume earl grey and tea bough tat chain store.

Multiple Correspondence Analysis (MCA)

Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

In the final plot different values for “ind” present in same column of the data have same color.

library(FactoMineR)
#res <- MCA(tea, quanti.sup=19, quali.sup=c(20:36), graph = F)

This code takes forever even when the graphs are not plotted. So I will also run the same with smaller dataset “tea_time” created with less number of columns

Multiple Correspondence Analysis (MCA) on the selected columns of tea data

library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Visualization of MCA on the selected columns of tea data

# visualize MCA, I have plotted viariable biplot as well as individual biplot
plot.MCA(mca, invisible=c("var","quali.sup"), cex=0.7)

plot(mca, invisible=c("ind","quali.sup"), cex=0.7)

plot(mca, invisible=c("ind"))

The scatterplots are quite homogeneous with some extreme points such as unpackaged and tea shop in individual plot.

# visualize MCA, I have plotted viariable biplot as well as individual biplot
plot(mca, graph.type = "classic")

plot(mca, invisible=c("var"), graph.type = "classic")

plot(mca, invisible=c("ind"), graph.type = "classic")

plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

dimdesc(mca)
## $`Dim 1`
## 
## Link between the variable and the categorical variable (1-way anova)
## =============================================
##               R2      p.value
## how   0.70769782 4.743070e-80
## where 0.70169065 9.728145e-79
## Tea   0.12597089 2.072984e-09
## sugar 0.06510460 7.631229e-06
## How   0.07575531 3.401438e-05
## 
## Link between variable abd the categories of the categorical variables
## ================================================================
##                              Estimate      p.value
## how=unpackaged              0.7352354 9.296592e-50
## where=tea shop              0.7755232 1.016669e-49
## Tea=black                   0.1275555 1.994908e-06
## sugar=No.sugar              0.1349392 7.631229e-06
## How=lemon                   0.2740418 3.863682e-05
## Tea=green                   0.1343915 3.013495e-03
## How=milk                   -0.2576104 2.550801e-03
## how=tea bag+unpackaged     -0.1144382 3.529149e-05
## sugar=sugar                -0.1349392 7.631229e-06
## where=chain store+tea shop -0.1273957 1.660721e-06
## Tea=Earl Grey              -0.2619470 2.472466e-10
## how=tea bag                -0.6207973 1.180088e-44
## where=chain store          -0.6481275 1.340115e-45
## 
## $`Dim 2`
## 
## Link between the variable and the categorical variable (1-way anova)
## =============================================
##               R2      p.value
## where 0.68148468 1.640037e-74
## how   0.52173497 2.696444e-48
## How   0.19031574 1.641412e-13
## Tea   0.10764680 4.515572e-08
## lunch 0.06362761 9.756757e-06
## 
## Link between variable abd the categories of the categorical variables
## ================================================================
##                               Estimate      p.value
## where=chain store+tea shop  0.71592830 1.221027e-64
## how=tea bag+unpackaged      0.58116824 5.608738e-44
## How=other                   0.62818710 1.424535e-08
## lunch=lunch                 0.18210707 9.756757e-06
## Tea=Earl Grey               0.18485576 3.974787e-03
## How=milk                   -0.16271132 1.519564e-02
## How=lemon                  -0.03039274 1.168288e-03
## lunch=Not.lunch            -0.18210707 9.756757e-06
## Tea=green                  -0.35458225 6.009202e-09
## How=alone                  -0.43508305 2.039306e-10
## how=unpackaged             -0.46020270 1.920895e-11
## where=tea shop             -0.57260797 8.006916e-13
## how=tea bag                -0.12096554 4.851713e-13
## where=chain store          -0.14332033 5.610381e-18
## 
## $`Dim 3`
## 
## Link between the variable and the categorical variable (1-way anova)
## =============================================
##               R2      p.value
## Tea   0.40989341 9.619982e-35
## How   0.39383806 5.776849e-32
## sugar 0.33619389 2.412708e-28
## lunch 0.11110184 3.233374e-09
## where 0.05454983 2.411784e-04
## 
## Link between variable abd the categories of the categorical variables
## ================================================================
##                      Estimate      p.value
## Tea=Earl Grey      0.32078574 1.605342e-28
## sugar=sugar        0.27170087 2.412708e-28
## How=lemon          0.77415042 1.104740e-17
## lunch=lunch        0.22062782 3.233374e-09
## where=tea shop     0.22015318 1.043046e-04
## How=alone          0.09880062 7.713644e-03
## where=chain store -0.14933787 4.568451e-03
## lunch=Not.lunch   -0.22062782 3.233374e-09
## How=other         -1.03063448 6.429911e-16
## sugar=No.sugar    -0.27170087 2.412708e-28
## Tea=black         -0.38804124 4.898283e-33

I fOUND it quite impossible to say anything on the first graphs. Most datapoints lie around the center. The plots seem pretty homogenous with some extreme points. So I am also using dimdesc function which is used for intrepretation of MCA.

The first principal component is characterized by the variables “how”, “where” followed by “tea”, “sugar” and “How. Characterization by categories seems similar to characterization by variables but it allowed more precision.For example, in dim1 the coordinate of the category”How=lemon” is positive whereas “How=milk”’s is negative. This means that individuals whose coordinate is positive tend to have tea with lemon. Another example in Dim1 is sugar (negative coordinate) and no sugar (positive coordinate), i.e individuals whose coordinate is positive tend to have tea without sugar. ***

(more chapters to be added similarly as we proceed with the course!)